In this session, we will use the Black Friday Data available in [Kaggle] to study how to make the following graphical displays.
In this session, we will use the Black Friday Data available in [Kaggle] to study how to make the following graphical displays.
In order to understand the customer purchases behavior against various products of different categories, the retail company “ABC Private Limited”, in the United Kingdom, shared purchase summaries of the various customers for selected high volume products from last march. The data contains the following variables.
Rows: 550,068
Columns: 12
$ User_ID <dbl> 1000001, 1000001, 1000001, 1000001, 1000002…
$ Product_ID <chr> "P00069042", "P00248942", "P00087842", "P00…
$ Gender <chr> "F", "F", "F", "F", "M", "M", "M", "M", "M"…
$ Age <chr> "0-17", "0-17", "0-17", "0-17", "55+", "26-…
$ Occupation <dbl> 10, 10, 10, 10, 16, 15, 7, 7, 7, 20, 20, 20…
$ City_Category <chr> "A", "A", "A", "A", "C", "A", "B", "B", "B"…
$ Stay_In_Current_City_Years <chr> "2", "2", "2", "2", "4+", "3", "2", "2", "2…
$ Marital_Status <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0…
$ Product_Category_1 <dbl> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1,…
$ Product_Category_2 <dbl> NA, 6, NA, 14, NA, 2, 8, 15, 16, NA, 11, NA…
$ Product_Category_3 <dbl> NA, 14, NA, NA, NA, NA, 17, NA, NA, NA, NA,…
$ Purchase <dbl> 8370, 15200, 1422, 1057, 7969, 15227, 19215…
Bar chart is a graphical display good for the general audience. Here, we stidy the distribution of Age Group of the companu’s customers who putchased their products on Black Friday. Usage: barplot(height, …)
A bar chart can be horizontal or vertical. Using the argument col, we can assign a color for bars. The argument main could be used to change the title of the figure. We can use RGB color code to assign colors.
Note:The margin of a figure could be set using the par() function. The order of the setting is c(bottom, left, top, right).
Similarly, we can use the pie chart to study the distribution of the city caregory.
Usage: pie(height,…)
Tip: Use color palette to choose colors (Google search:c color scheme generator).
Histograms are used when we want to study the distubition of a quantitative variable. Here we study the distribution of customer purchase amounts.
Usage: hist(x, …)
In general, a boxplot is used when we want to compare the distributions of several quantitative variables. In the following example, we study the distrubution of customer purchase amounts among different age groups.
When we want to study the relationship of two quantitative variables, a scatterplot can be used. Since this data set doesn’t have another quantitative varible, we will use the build-in data mtcars in R Then we study the relationship of miles per gallon against the weight of the vehicles.
Since the Black Friday Data is not a time series data, it is not appropiate to use a line plot. In the following code chunk, we create a data frame using the forecasted highest temperature from July 13th to July 22nd in 2022.
---
title: "Le Et Blah"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: sketchy
navbar-bg: "darkmagenta"
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(plotly)
library(DT)
Fri<-read_csv("Black_Friday.csv")
```
Brief Overview 1
===
Column {data-width=450}
---
In this session, we will use the Black Friday Data available in [Kaggle] to study how to make the following graphical displays.
Column {.tabset data-width=550}
---
### Graphical Displays
- Categorical Data
- Bar Chart
- Pie Chart
- Quantitative Data
- Histogram
- Box plot
- Scatter plot
### Common Arguments
- col: a vector or colors
- main: a title for the plot
- xlim or ylim: limits for the x or y axis
- xlab or ylab: a lebel for the x or y axis
- font: font used for text, 1=plain; 2=bold, 3=italic, 4=bold italic
- font.axis: font used fo rhte axis
- cex.axis: font size for x and y axes
- font.lab: font for x and y labels
- cex.lab: font size for x and y labels
Brief Overview 2 {data-orientation=rows}
===
Row {data-height=100}
---
In this session, we will use the Black Friday Data available in [Kaggle] to study how to make the following graphical displays.
Row {data-height=900}
---
### Graphical Displays
- Categorical Data
- Bar Chart
- Pie Chart
- Quantitative Data
- Histogram
- Box plot
- Scatter plot
### Common Arguments
- col: a vector or colors
- main: a title for the plot
- xlim or ylim: limits for the x or y axis
- xlab or ylab: a lebel for the x or y axis
- font: font used for text, 1=plain; 2=bold, 3=italic, 4=bold italic
- font.axis: font used fo rhte axis
- cex.axis: font size for x and y axes
- font.lab: font for x and y labels
- cex.lab: font size for x and y labels
Data
===
Column {data-width=550}
---
### <b><font size = 4><span Style = "color:darkslateblue">First 500 Observations</span></font></b>
```{r show_table}
datatable(Fri[1:500,],rownames = F, colnames = c("User ID", "Product ID", "Gender", "Age", "Occupation", "City Category", "Stay In Current City Years", "Marital Status", "Product Category 1", "Product Category 2", "Product Category 3", "Purchase"), options = list(pageLength = 20))
```
Column {data-width=450}
---
### <font size = 4><span Style = "color:darkslateblue">Description</span></font>
In order to understand the customer purchases behavior against various products of different categories, the retail company "ABC Private Limited", in the United Kingdom, shared purchase summaries of the various customers for selected high volume products from last march. The data contains the following variables.
- User ID: User ID
- Product ID: Product ID
- Gender: Sex of User
- Age: Age in Bins
- Occupation: Occupation (Masked)
- City_Category: Category of the City (A,B,C)
- Stay_In_Current_City_Years: Number is years stayed in current city
- Marital_Status: Marital Status
- Product_Category_1: Product Category (Masked)
- Product_Category_2: Product may belong to other category also (Masked)
- Product_Category_3: Product may belong to other category also (Masked)
- Purchase: Purchase Amount
```{r}
glimpse(Fri)
```
Bar Chart {data-orientation=rows}
===
Row {data-height=350}
---
###
Bar chart is a graphical display good for the general audience. Here, we stidy the distribution of Age Group of the companu's customers who putchased their products on Black Friday.
**Usage:** barplot(height, ...)
A bar chart can be horizontal or vertical. Using the argument <span Style="color:darkred">col</span>, we can assign a color for bars. The argument <span Style="colo:darkred">main</span> could be used to change the title of the figure. We can use RGB color code to assign colors.
**Note:**The margin of a figure could be set using the <span Style="color:darkred">par()</span> function. The order of the setting is <span Style="color:darkred">c(bottom, left, top, right)</span>.
### Analysis
Row {data-height=650}
---
### **Vertical Bar Chart**
```{r bar1}
par(mgp=c(4,1,0)) # change the margin line for the axis title, axis labels, and axis line
par(mar=c(5,7,4,2)) # set margin of the figure
barplot(table(Fri$Age),
col="darkorchid",
main="Distribution of Purchases by the Customer's Age",
ylab = "Number of PUrchases",
xlab = "Age Group")
```
### **Horizontal Bar Chart**
```{r bar2}
par(mgp=c(4,1,0)) # change the margin line for the axis title, axis labels, and axis line
par(mar=c(5,7,4,2)) # set margin of the figure
Fri %>%
ggplot(aes(x=Age))+
geom_bar(fill="#037349")+
coord_flip()+
labs(title = "Distribution of Purchases by Customer's Age",
x = "Age Groups",
y = "Number of Purchases")->bar1
ggplotly(bar1)
```
Pie Chart
===
Column {data-width=500}
---
Similarly, we can use the pie chart to study the distribution of the city caregory.
**Usage:** pie(height,...)
**Tip:** Use color palette to choose colors (Google search:c color scheme generator).
### Analysis
Column {data-width=500}
---
### Distribution of City Category
```{r pie}
H <- table(Fri$City_Category)
percent <- round(100*H/sum(H),1)
pie_labels <- paste(percent, "%", sep="")
colpie <- c("#6E0D25","#774E24","#DCAB6B")
pie(H,main="Distribution of City Category", labels = pie_labels, col = colpie)
legend("topright", c("A","B","C"), cex = 0.8, fill = colpie)
```
Histogram
===
Column {data-width=500}
---
###
Histograms are used when we want to study the distubition of a quantitative variable. Here we study the distribution of customer purchase amounts.
**Usage:** hist(x, ...)
```{r histogram}
Fri %>% ggplot(aes(x=Purchase))+
geom_histogram(fill="darkslategray")+
labs(title = "Distribution of Customer Purchase Amounts",
x = "Purchase Amount (British Pounds)")
```
Column {data-width=500}
---
### Analysis
Boxplot
===
Column {.tabset data-widtch=550}
---
### Boxplot 1
#### B1
```{r boxplot1}
boxplot(Fri$Purchase, xlab="Purchase Amount", ylab="British Pounds")
```
### Boxplot 2
#### B2
```{r boxplot2}
boxplot(Purchase ~ Gender + Marital_Status, data = Fri, main="Distribution of Purchase by Sex and Marital_Statis",
xlab="Sex and Marital Status", ylab="Purchase", cex.lab= 0.75, cex.axis=0.5,
names = c("Female & Single", "Male & Single", "Female & Married", "Male & Married"))
```
In general, a boxplot is used when we want to compare the distributions of several quantitative variables. In the following example, we study the distrubution of customer purchase amounts among different age groups.
Column {data-width=450}
---
### Analysis of Boxplot 1
### Analysis of Boxplot 2
Scatterplot
===
Column {data-width=500}
---
###
When we want to study the relationship of two quantitative variables, a scatterplot can be used. Since this data set doesn't have another quantitative varible, we will use the build-in data <span class = "darkred">mtcars</span> in R Then we study the relationship of miles per gallon against the weight of the vehicles.
```{r scatterplot}
plot(mpg ~ wt, data=mtcars,
xlab = "Weight (100 lbs)", ylab = "Miles per Gallon",
pch = 19, col = "deeppink4")
```
Column {data-width=500}
---
### Analysis
Line Plot
===
Column {.tabset data-width=350}
---
### Data
Since the Black Friday Data is not a time series data, it is not appropiate to use a line plot. In the following code chunk, we create a data frame using the forecasted highest temperature from July 13th to July 22nd in 2022.
```{r data}
Date <- 13:22
Dayton_OH <- c(84,86,91,89,89,91,92,91,91,91)
Houston_TX <- c(100,97,96,94,94,94,93,93,92,91)
Denvor_CO <- c(95,85,89,96,97,96,92,91,95,96)
Fargo_ND <- c(86,80,84,87,90,87,83,84,87,89)
df <- data.frame(Date,Dayton_OH,Houston_TX,Denvor_CO,Fargo_ND)
datatable(df,rownames = F, colnames = c("Date", "Dayton, OH", "Houston, TX", "Denvor, CO", "Fargo, ND"))
```
### Analysis
Column {data-width=650}
---
### Line Chart
```{r line1}
plot(Date, Dayton_OH, type="o", col="#9F9AA4", xlab = "Date in July", ylab = "Highest Temerature", ylim = c(80,100))
lines(Date, Houston_TX, type = "o", col = "#E7CFCD")
lines(Date, Denvor_CO, type = "o", col = "#B5C9C3" )
lines(Date, Fargo_ND, type = "o", col = "#CAB1BD")
legend("topright", legend = c("Dayton, OH", "Houston, TX", "Denvor, CO", "Fargo, ND"),
col = c("#9F9AA4","#E7CFCD","#B5C9C3","#CAB1BD"),
lty = 1,
pch = 1)
```